Model Selection

Chinese Image-Text Retrieval

# Chinese Image-Text Retrieval

Chinese Clip Vit Large Patch14 336px

Chinese CLIP is a simplified implementation of CLIP based on approximately 200 million Chinese image-text pairs, using ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder.

Chinese Clip Vit Base Patch16

The base version of Chinese CLIP, using ViT-B/16 as the image encoder and RoBERTa-wwm-base as the text encoder, trained on a large-scale dataset of approximately 200 million Chinese image-text pairs.

Taiyi CLIP RoBERTa 326M ViT H Chinese

The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, with RoBERTa-large architecture as the text encoder.

Transformers Chinese

Taiyi CLIP Roberta Large 326M Chinese

The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, supporting Chinese image-text feature extraction and zero-shot classification

Transformers Chinese

Taiyi CLIP Roberta 102M Chinese

The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, with a text encoder based on RoBERTa-base architecture.

Transformers Chinese

Mengzi Oscar Base Retrieval

A Chinese image-text retrieval model fine-tuned on the COCO-ir dataset based on the Chinese multimodal pretraining model Mengzi-Oscar

Transformers Chinese

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase